OncoCast is a computational tool we developed for integrating broad-panel tumor sequencing data with clinical outcome for cancer patient survival prediction. It relies on ensemble learning for outcome predictions using multiple types of data to generate a composite risk score. More specifically OncoCast applies Cox penalized regression methods (LASSO, RIDGE and elastic-net) and generates a cross-validated genomic risk score. This score can be subsequently used to stratify each patient into different risk subsets. The program also generates visualization of feature importance to allow the identification of biomarkers for predicting clinical outcome. A validation module is also included for calculating risk score on validation data sets.
Here we use a simulated data set to illustrate the use of OncoCast. A total of n=300 subjects were simulated. Survival time for the subjects were simulated using exponential model with uniform censoring. A total of 15 features were simulated, five of which associated with outcome and named ‘ImpCov#’. All the other features were named ‘Cov#’. Here is the code used to generate the data:
## SETUP ###
set.seed(3)
ImpCovMin <- 1
ImpCovMax <- 1.5
n <- 300
x.imp <- cbind(rbinom(n, 1, runif(1,0.1,0.5)),rbinom(n, 1, runif(1,0.1,0.5)),
rbinom(n, 1, runif(1,0.1,0.5)),rbinom(n, 1, runif(1,0.1,0.5)),
rbinom(n, 1, runif(1,0.1,0.5)))
coefs <- c(runif(2,ImpCovMin,ImpCovMax),runif(3,-ImpCovMax,-ImpCovMin))
x <- x.imp
mu <- as.vector(coefs %*% t(x))
time <- rexp(n,exp(mu))*12
c <- runif(n,0,as.numeric(quantile(time,0.8)))
status <- ifelse(time < c , 1,0)
time <-pmin(time,c)
numDummy <- 10
DummyCov <- lapply(1:numDummy,function(x){
rbinom(n, 1, runif(1,0.1,0.2))
})
x.dumb <- do.call(cbind, DummyCov)
x <- as.data.frame(cbind(x,x.dumb))
data <- as.data.frame(cbind(time,status,x))
colnames(data) <- c("time","status",paste0("ImpCov",1:ncol(x.imp)),paste0("Cov",(ncol(x.imp)+1):ncol(x)))
rownames(data) <- paste0("Patient",1:n)
survData <- data
devtools::use_data(survData, internal = F,overwrite = T)
The first dataset contains only a time and censoring variable (named ‘survData’), respectively time and status.
### dataset 1
dim(survData)
## [1] 300 17
head(survData[,1:7])
## time status ImpCov1 ImpCov2 ImpCov3 ImpCov4 ImpCov5
## Patient1 9.933107 0 0 1 1 1 1
## Patient2 32.944307 1 0 0 0 0 1
## Patient3 2.264857 1 0 0 0 0 0
## Patient4 3.658523 1 0 0 0 0 0
## Patient5 12.197182 1 0 1 0 1 0
## Patient6 18.444098 0 0 0 0 1 0
There are two main functions contained in the package: OncoCast and getResults_OC. The OncoCast function performs the esemble learning through penalized regression and cross-validation. getResults_OC summarizes the output and enables exploration of the results in multiple graphical interfaces and summary tables.
The first step is to install the package and load it in the environment of your working directory.
library(OncoCast)
The user has access to the simulated datasets in the package. It is important to note that the method does not currently handle missing data. The user needs to perform imputation on their own before using the software.
anyNA(survData)
## [1] FALSE
dim(survData)
## [1] 300 17
The above shows there are 300 patients with 15 features in the simulated data set. We recommend users go through the manual using the ??OncoCast command to see the different options. Multithreading is implemented to speed up computation. The user should check the number of cores available using detectCores() before performing any analysis.
library(doParallel)
detectCores()
## [1] 24
We recommend setting the number of cross-validation to 100 (runs=100) as a minimum. We pipe the results to an R list object called Out.
library(OncoCast)
Out <- OncoCast(data=survData,formula = Surv(time,status)~.,sampling="cv",
cores=2,runs=100,method=c("LASSO","RIDGE"),save=F)
## [1] "Data check performed, ready for analysis."
## [1] "LASSO SELECTED"
## [1] "RIDGE SELECTED"
Once the run completed the user can access the results from the cross validation from each method using the ‘$’ operator.
length(Out$LASSO)
## [1] 100
str(Out$LASSO[[1]])
## List of 4
## $ CI : num 0.749
## $ fit : Named num [1:15] 0.876 0.97 -1.186 -1.235 -0.954 ...
## ..- attr(*, "names")= chr [1:15] "ImpCov1" "ImpCov2" "ImpCov3" "ImpCov4" ...
## $ predicted: Named num [1:300] NA NA NA 0.468 NA ...
## ..- attr(*, "names")= chr [1:300] "Patient1" "Patient2" "Patient3" "Patient4" ...
## $ means : Named num [1:8] 0.145 0.255 0.41 0.375 0.135 0.205 0.16 0.155
## ..- attr(*, "names")= chr [1:8] "ImpCov1" "ImpCov2" "ImpCov3" "ImpCov4" ...
The Out$LASSO object is a list of length 100 (each index represents a split). Each of these objects themselves contains lists containing (as shown above) the method used, the data, the concordance index for that split, the reffited Cox Proportional hazard model using the penalized regression coefficients value found using the training set, and the predicted risk score for all the patients that fell in the testing set for that split.
These results will be further fed in the getResults_OC function that will let us explore the results in more simplistic and efficient manner. This function takes in as arguments :
- VarSelectObject : Which is a list object outputed by the OncoCast function
- numGroups : a numeric argument that will set the number of risk groups to be made (default is 2 and can go up to 4)
- cuts : a numeric vector of length numGroups - 1, will set the cuts (a vector of quantiles) that will be made in the data based on the predicted averaged risk (default the cut will be made at the median, i.e. cuts=c(0.5))
- mut.data : a boolean value, in this case of binary mutational data this option will let the user explore the mutations in the different risk groups generated
We will see how to use this function only with the LASSO output generated above.
lasso.results <- getResults_OC(Out$LASSO,data=survData,numGroups=4,cuts = c(0.25,0.5,0.75),mut.data = T)
The lasso.results object is a list with multiple plots and tables. We start with the observation of the risk score generated, its refit as a predictor in a Cox proportional hazard model, and the cross-validated concordance index distribution.
lasso.results$RiskHistogram
lasso.results$RiskScoreSummary
## Lower 10% 1st Quarter 1st Tertile Median 2nd Tertile
## Risk Score 2.68 4.15 4.56 5.7 6.22
## 3rd Quarter Upper 10%
## Risk Score 6.41 7.94
kable(summary(lasso.results$RiskRefit)$coefficients)
| coef | exp(coef) | se(coef) | z | Pr(>|z|) | |
|---|---|---|---|---|---|
| RiskScore | 0.679119 | 1.972139 | 0.0563721 | 12.04707 | 0 |
lasso.results$ciSummary
## Lower 10% 1st Quarter Median 3rd Quarter Upper 10%
## Concordance Index 0.75 0.76 0.78 0.79 0.81
The histogram shows the distribution of the composite risk score computed for each patients. This was rescaled from 0 to 10 for ease of interpretation. The median concordance index for the OncoCast score for predicting survival is 0.78.
In order to explore feature selection two plots are available, the first is a simple bar plot of the selection frequency of the most frequently selected features. The second has more details, it lets the user interactively explore the selection and coefficients of all the features at once. As the user hovers over the points the name of the gene represented by that point along with the average hazard for a mutation on that gene.
lasso.results$inflPlot
lasso.results$selectInflPlot
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
As expected we observe on the bar plot that the features that were simulated to be associated with the outcome have been selected 100% of the time.
The volcano plot shows the frequency at which each feature was selected in the 100 splits we have generated by the mean penalized coefficients found in the cross-validation. Note that the size and color of the points are linked to the mutation frequency of each feature. We observe now that not only do the important covariates have higher selection frequency but also much larger coefficient values.
As mentioned above the numGroups argument will enable the user to create between 2 to 4 groups that are made by making specified cuts (using the cuts argument) in the averaged predicted risk score. In this example we made four groups with cuts at 25th, 50th and 75th percentile. For example, the first group corresponds to the patients in the lowest quartile of the predicted risk score (0 <= predicted risk score <= 4.15). We can check how distinct the groups are in survival using a Kaplan-Meier plot as shown below.
lasso.results$KM_Plot
Furthermore two other tools contained in the output of this function let us explore the distribution of features per group generated. Note that this is only possible with binary data at the moment. The first plot is simple bar plot of the frequency of feature events for the most frequently selected predictors per group.
lasso.results$mut_Plot
In order to explore the combinations of genes in each risk group, getResults_OC function returns a second interactive plot. It is a multiple pie chart plot where each pie represents a risk group (in increasing order of risk). As the user hovers over different slices of the pies the name of the mutated genes and gene combinations in that profile along with the number of patients showing that profile and the proportion of patients in that risk group they represent.
To ensure that the number different possible profiles is not too high and ensure that the plot remains readable we limit to the use of the top five most frequently selected genes. The user can also specify the features to be used to generate the pie chart by inputting a character vector containing the exact names of the features into the geneList argument (which is defaulted to be the top 5 most frequently selected genes).
lasso.results$PieChart
As an example, if the user hovers on the pie chart for the high risk group (far right) and on the dark red slice, it shows that 21 patients (or 28%) in the high risk group have the ImpCov2 and ImpCov3 combination.
The last available function in the package enables the user to input a set of new patient data with the same set of features in the original patient set used for training the risk model and retrieve the predicted genetic risk score for the new subjects.
set.seed(3)
new.data <- as.data.frame(matrix(rbinom(10*100,1,0.5),nrow=100,ncol = 10))
colnames(new.data) <- c("ImpCov1","ImpCov2","ImpCov3","ImpCov4","ImpCov5","Cov6","Cov7",
"Cov8","Cov9","Cov10")
rownames(new.data) <- paste0("Incoming",1:100)
Incoming <- predictIncoming(Out$LASSO,new.data,surv.print = c(5,10,15),riskRefit = lasso.results$RiskRefit)
Here are the different objects outputted by this function, first a data frame of the new patients data with an added column containing the predicted genetic risk score for those patients.
head(Incoming$data.out)
## ImpCov1 ImpCov2 ImpCov3 ImpCov4 ImpCov5 Cov6 Cov7 Cov8 Cov9
## Incoming1 0 1 1 0 0 0 0 0 1
## Incoming2 1 1 1 0 1 0 1 1 1
## Incoming3 0 0 0 1 1 1 1 1 1
## Incoming4 0 1 0 1 1 0 1 1 0
## Incoming5 1 1 1 1 1 1 0 1 1
## Incoming6 1 0 0 1 0 1 0 1 1
## Cov10 OncoCastRiskScore
## Incoming1 1 6.535052
## Incoming2 0 6.010438
## Incoming3 0 1.679513
## Incoming4 1 3.454609
## Incoming5 1 4.496586
## Incoming6 0 6.059960
Second is a histogram showing the distribution of those predicted genetic risk scores.
Incoming$RiskHist
An interactive survival plot of individual patients selected using the surv.print function. As the user hovers over the different points extra information will be displayed.
Incoming$IncKM
Note that the user can select which predicted survival curves will be displayed in the above graph using the surv.print argument.
The OncoCast function also allows the user to perform a left-truncated survival analysis. A simple change in the formula argument will suffice to do this. Taking a simple example with the survData.LT data included in the package, the time and status variables are named time1 (for the late-entry time), time2 (for the survival time) and status (for the censoring). An example of such data can be found in the ‘survData.LT’ set.
head(survData.LT[,1:5])
## time1 time2 status ImpCov1 ImpCov2
## Patient 1 0.02966663 3.0503361 1 0 1
## Patient 2 16.84681767 50.0844402 1 1 0
## Patient 3 0.24815977 0.9974178 1 1 0
## Patient 4 37.88561581 45.7715850 1 0 0
## Patient 5 7.53586627 100.0087122 0 0 0
## Patient 6 0.13734736 1.7932666 1 1 1
The user simply needs to use the argument : formula = Surv(time1,time2,status~.)